MSI-X – the right way to spread interrupt load
When considering ways to spread interrupts from one device among multiple cores, I can’t not to mention MSI-X. The thing is that MSI-X is actually the right way to do the job.
Interrupt affinity, which I discussed here and here, has a fundamental problem. That is inevitable CPU cache misses. To emphasise this, think about what happens when your computer receives a packet from the network. Packet belongs to some connection. With interrupt affinity the packet would land on core X, while the chances are that previous packet on the same TCP connection has landed on core Y (X ≠ Y).
Handing the packet would require kernel to load TCP connection object into X’s cache. But, this is so ineffective. After all, the TCP connection object is already in Y’s cache. Wouldn’t it be better to handle second packet on core Y as well?
This is the problem with interrupt affinity. From one point of view we want to spread interrupts to even the load on cores. From another point of view, doing simple round robin isn’t enough. The little fella that decides where each interrupt goes, should be able to look into the packet and depending on what TCP connection it belongs to, send the interrupt to core that handles all packets that belong to this connection.
Ideally, NICs should be able to:
- Look into packets and identify connections.
- Direct interrupt to core that handles the connection.
Apparently, this functionality already here. Devices that support MSI-X do exactly this.
Meet MSI-X
MSI-X is an extension to MSI. MSI replaces good old pin based interrupt delivery mechanism.
Each IO-APIC chip (x86 permits up to 5) has 24 legs, each connected to one or more devices. When IO-APIC receives an interrupt, it redirects the interrupt to one of the local-APICs. Each local-APIC connected to a core that receives an interrupt.
MSI provides a kind of protocol for interrupt delivery. Instead of raising signal on pins, PCI cards send a message over MSI and IO-APIC translates the message into right interrupt. Theoretically this means that each device can have number of interrupt vectors. In reality, plain MSI does not support this, but MSI-X does.
Modern high-end network cards that support MSI-X, implement multiple tx-rx queues. Each queue tied up to an interrupt vector and each NIC has plenty of them. I checked Intel’s 82575 chipset. With igb driver compiled properly, it has up to eight queues, four rx and four tx. Broadcom’s 5709 chipset provides eight queues (and eight interrupt vectors), each handling both rx and tx.
In kernel 2.6.24, kernel developers introduced new member of struct sk_buff called queue_mapping. This member tells incoming NIC driver what queue to use when transmitting the packet.
Before transmitting the packet, kernel decides what queue to use for this packet (net/core/dev.c:dev_queue_xmit()). It uses two techniques to do so. First, kernel can ask NIC driver to provide a queue number for the packet. This functionality, however, is optional in NIC drivers and at the moment both Intel and Broadcom drivers don’t provide it. Otherwise, kernel uses a simple hashing algorithm that produces 16 bit number from two ip addresses and (in case of TCP or UDP) two port numbers. All this happens in function named simple_tx_hash() in net/core/dev.c.
When receiving packets, things are even easier because NIC firmware and the driver decide what queue to use to introduce the packet to the kernel.
Using this simple technique kernel and modern NIC’s can verify that packets that belong to certain connection land on certain queue. Using interrupt affinity binding techniques you can bind certain interrupt vector to certain core (writing to smp_affinity, etc). Thus you can spread interrupts among multiple cores and yet make sure there are no cache misses.
Hi Alex !
Once again, here’s a nice article… Thanks for sharing your knowledges.
For outbound packets, the Kernel builds a hash based on IP addresses and port numbers (source & destination, I suppose ?) in order to bind the corresponding flow to a given TX queue. I was wondering if the hash is build in the same manner for inbound packets / RX queues ?
What I understand is that the driver is in charge of binding a given ingress flow to a given RX queue. Does that mean that the sysadmin cannot configure it a posteriori (with ethtool for instance) ?
“Using interrupt affinity binding techniques you can bind certain interrupt vector to certain core” : can you please give us further details on how to to setup that ? Does that mean that each queue will appear as a particular device under /proc/interrupts ?
Finally, did you hear about the TNAPI and PF_RING patches of Lucas Deri (http://www.ntop.org/TNAPI.html) ? If the MSI-X feature is already implemented in the concerned drivers (Intel igb, igbx), I don’t catch what is the benefit of the TNAPI patch. What is your opinion about this ?
Telenn
MSI-X is great, and is also now used by defualt in the linux kernel.
2.6.33 and onwards!
no more sharing irq on my laptop.
great article!
@ninez
Thanks for sharing your experience and for a warm comment. Please come again!
Great article Alex, thanks a lot
I have not checked in here for a while as I thought it was getting boring, but the last few posts are good quality so I guess I¡¦ll add you back to my everyday bloglist. You deserve it friend
Are you aware if they make any plugins to help with Web optimization? I’m trying to get my weblog to rank for some targeted search phrases but I’m not seeing encouraging gains. In case you know of any make sure you let me know. It would mean a lot
Nice post. I was checking continuously this blog and I am impressed! Extremely helpful info particularly the last part I care for such info much. I was looking for this certain information for a very long time. Thank you and best of luck.
It’s a shame you don’t have a donate button! I’d most certainly donate to this outstanding blog! I guess for now i’ll settle for book-marking and adding your RSS feed to my Google account. I look forward to fresh updates and will talk about this blog with my Facebook group. Chat soon!
@David Gray
Thanks
Thank for your articles, I will share your link in my facebook.
This is very nice for me.
Thanks for the knowledge, god bless you.
Thank you.
[…] http://www.alexonlinux.com/msi-x-the-right-way-to-spread-interrupt-load MSI-X the right way to spread interrupte […]
Very informative.
Quite an informative article. Thanks!
While the network stack has been modified to take advantages of the multiple queues provided by devices, is there something similar planned for storage side traffic as well?
Something similar to what other OSes (VMWare, Windows have to offer).
[…] http://www.alexonlinux.com/msi-x-the-right-way-to-spread-interrupt-load […]
[…] You can read more here http://www.alexonlinux.com/msi-x-the-right-way-to-spread-interrupt-load […]
adelaide hills b…
MSI-X – the right way to spread interrupt load – Alex on Linux
[…] Berechnung von Prüfsummen sowohl für IPv4 wie auch IPv6 sowie Unterstützung von Message Signal Interrupt Extension (MSI-X), um die Verarbeitung von Datenpaketen auf Mehrkern-Systemen durch Parallelverarbeitung zu […]
[…] MSI-X – the right way to spread interrupt load […]
Silver Service Taxi Melbourne where you will ride in latest luxury and vehicles by professional drivers within cheap prices,Travel safe,comfort and fast.
Luxury Taxi SEO, Contact me for SEO
Airport TAXI & Silver Service Taxi Melbourne vehicles have you locked in.. You will ride in latest luxury and also luxury vehicles by professional drivers within cheap prices,Travel safe,comfort and fast.
My Silver Service offers high-class silver Cabs and chauffeur services for airport pickup, special events occasion, sightseeing tour, and winery tour and parcel delivery. For taxi or chauffeur service to airport or any destination, simply book your ride with us.
Thanks for sharing this great. Keep sharing more useful and conspicuous stuff like this. Thank you so much
Nice Post! Loved to read it
Thanks for sharing very helpful for me
Nice information about MSI-X! It is very helpful for us. Thanks for sharing this amazing post.